Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZNS support #1298

Closed
wants to merge 11 commits into from
Closed

ZNS support #1298

wants to merge 11 commits into from

Conversation

MaisenbacherD
Copy link

This draft PR addresses the issue #1297 and shall encourage discussion.
This should not be merged as is as there are two dependencies that need to be resolved before this can potentially be merged:

@MaisenbacherD
Copy link
Author

Rebased this PR onto develop

@orville-wright
Copy link
Contributor

Hi @MaisenbacherD - I believe that SPDK v23.01 is now integrated into Mayastor.

@orville-wright
Copy link
Contributor

@tiagolobocastro can you confirm that SPDK v23.01 is now the current default base SPDK version in the product?

@orville-wright
Copy link
Contributor

Hi @MaisenbacherD I just chatted with @tiagolobocastro and he clarified that we will be updating to the latest release of SPDK soon, which is v24.01. and that we are already at v23.01 as of now.

@MaisenbacherD
Copy link
Author

Hi @orville-wright
Thanks for confirming!
This draft was rebased onto the develop branch when SPDK v23.01 was already picked up by you - so the SPDK version requirement is fulfilled :)

Let me know if there is something else I can do.

@orville-wright orville-wright added Enhancement New feature or request Ready for review Ready to be reviewed by team labels Feb 27, 2024
Copy link
Contributor

@tiagolobocastro tiagolobocastro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haven't finished, will resume tomorrow

nr_seq_zones: u32,
max_active_zones: u32,
max_open_zones: u32,
) -> Result<u32, (i32, String)> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we return a new type Wrapper with RAII instead? example:

Suggested change
) -> Result<u32, (i32, String)> {
) -> Result<NullBlk, (i32, String)> {
struct NullBlk(u32);
impl Drop for NullBlk {
  fn drop(&mut self) {
    delete_nullblk_device(self.0);
  }
}

@@ -541,3 +542,92 @@ macro_rules! test_diag {
println!($($arg)*);
}}
}

pub fn create_zoned_nullblk_device(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some doc comments explaining what this nullblk is, for example, does it just ignore writes and return zeroes?

@@ -103,6 +103,7 @@ url = "2.2.2"
gettid = "0.1.2"
async-process = { version = "1.5.0" }
rstack = { version = "0.3.2" }
jemalloc-sys = "0.5.2+5.3.0-patched"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this used for? A patched version makes wonder if this is stable enough..

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for mentioning this!
I needed a way to allocate memory for the spdk_bdev_zone_info c struct. I am using calloc and free of this allocator crate.

Can you suggest a nicer way to do this? :)

@hrudaya21 Unfortunately, there is no convenient way to allocate this struct within SPDK.

SPDK_BDEV_ZONE_STATE_OFFLINE => Ok(SPDK_NVME_ZONE_STATE_OFFLINE),
SPDK_BDEV_ZONE_STATE_EXP_OPEN => Ok(SPDK_NVME_ZONE_STATE_EOPEN),
_ => {
error!("Can not map SPDK_BDEV_ZONE_STATE {} to any SPDK_NVME_ZONE_STATE", bdev_zone_state);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
error!("Can not map SPDK_BDEV_ZONE_STATE {} to any SPDK_NVME_ZONE_STATE", bdev_zone_state);
error!("Can't map SPDK_BDEV_ZONE_STATE {bdev_zone_state} to any SPDK_NVME_ZONE_STATE");

error!("Can not map SPDK_BDEV_ZONE_STATE {} to any SPDK_NVME_ZONE_STATE", bdev_zone_state);
Err(CoreError::NvmeIoPassthruDispatch {
source: Errno::EINVAL,
opcode: 122,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does 122 mean? and 121 below

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stands for Zone Management Receive and Zone Management Send respectively. I encoded those values into an enum.

@@ -82,6 +85,14 @@ pub trait BlockDevice {
&self,
listener: DeviceEventSink,
) -> Result<(), CoreError>;

fn is_zoned(&self) -> bool;
fn get_zone_size(&self) -> u64;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in rust getters, don't tend to have get_ prefix

bdev.remove_alias(self.url.as_ref());
let errno = unsafe {
bdev_nvme_delete(
self.name.clone().into_cstring().as_ptr(),
std::ptr::null(),
&mut path_id as *mut nvme_path_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be rebased

@@ -253,10 +266,20 @@ impl<'n> NexusBio<'n> {
if status == IoCompletionStatus::Success {
self.ctx_mut().successful += 1;
} else {
error!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already have traces below? I'd prefer not to add too many traces in the IO path until we add a redirection. Currently traces are a blocking call so if stdout/err is not pulled fast enough it can block the task.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it makes sense. I will remove this line. :)


self.completion_error(child, status);
// Don't take zoned child out on zoned related nvme errors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we want to take it out in this case?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zoned storage related status codes like TOO_MAY_OPEN_ZONES or ZONE_IS_FULL might be handled by the consuming application that is aware of the zoned storage interface.

Because no zoned replication is enabled yet, taking out the nexus is a bit harsh.
With zoned replication support later on we might want to take the nexus out again, depending on the logic of the replication.

) -> Result<Vec<(String, String)>, Error> {
let mut children_devices = Vec::new();
for uri in children {
let device_name = device_create(uri).await.unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to handle errors here rather than unwrap

@MaisenbacherD
Copy link
Author

haven't finished, will resume tomorrow

@tiagolobocastro Thanks so far! I will address your comments soon after you are done :)

Copy link
Contributor

@tiagolobocastro tiagolobocastro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsavitskiy @dsharma-dc @hrudaya21 please take a look as well

max_zone_append_size: u32,
max_open_zones: u32,
max_active_zones: u32,
optimal_open_zones: u32,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to have some doc comments here describing what these mean for the folks who are not very familiar with zones, such as myself :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add doc comments. Struct moves to spdk-rs/src/bdev.rs as I refactored the with_zoned bdev builder function to take the ZoneInfo and spdk-rs should not depend on mayastor.

Comment on lines 362 to 374
impl Default for ZoneInfo {
fn default() -> ZoneInfo{
ZoneInfo{
zoned: false,
num_zones: Default::default(),
zone_size: Default::default(),
max_zone_append_size: Default::default(),
max_open_zones: Default::default(),
max_active_zones: Default::default(),
optimal_open_zones: Default::default(),
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply add Default to the derive above

Suggested change
impl Default for ZoneInfo {
fn default() -> ZoneInfo{
ZoneInfo{
zoned: false,
num_zones: Default::default(),
zone_size: Default::default(),
max_zone_append_size: Default::default(),
max_open_zones: Default::default(),
max_active_zones: Default::default(),
optimal_open_zones: Default::default(),
}
}
}

@@ -387,6 +418,13 @@ impl<'n> Nexus<'n> {
.with_block_count(0)
.with_required_alignment(9)
.with_data(n)
.with_zoned(zone_info.zoned)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would be tidier to add a with_zone method which takes the zone_info rather than adding all these here 1by1?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree! Will change this :)

@@ -404,6 +442,7 @@ impl<'n> Nexus<'n> {
name.to_string(),
n.bdev.as_mut().unwrap(),
));
n.is_zoned = zone_info.zoned;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this not be set in the build()?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove the is_zoned field as it was unused. Zoned information can be retrieved via the bdev.

let mut info_set = false;

for (_uri, device_name) in &*children_devices {
let dev = device_lookup(&device_name).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should return an error

Comment on lines 1395 to 1397
let mut zone_info = ZoneInfo::default();
let mut conventional = false;
let mut info_set = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there another name for non-zoned blk devices btw? (rather than conventional)

Suggested change
let mut zone_info = ZoneInfo::default();
let mut conventional = false;
let mut info_set = false;
// if we find non-zoned block devices.
let mut found_conventional = false;
let mut zone_info: Option<ZoneInfo> = None;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, non-zoned blk devs are called "conventional". In the NVMe context, one could also label such a device as an NVM device (vs ZNS for zoned block devices), which refers to the I/O command set.


if dev.is_zoned() {
zone_info.zoned = true;
if !info_set {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if let Some(zone_info) = zone_info.as_mut() or match

}

if zone_info.zoned == conventional {
error!("Can not handle conventional and zoned storage at the same time in a nexus");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe include the nexus in the log too?

Suggested change
error!("Can not handle conventional and zoned storage at the same time in a nexus");
error!("{nexus:?} - can not handle conventional and zoned storage at the same time in a nexus");

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nexus does not exist at this point, but I can include the name of the requested nexus.

}

unsafe {
free(bdev_zone_info_c_void);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be done with RAII?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most likely yes. Let me wait on that with the discussion above about the jemalloc-sys crate. :)


let bdev_zone_info_c_void = unsafe { calloc(1, size_of_spdk_bdev_zone_info) };
loop {
if zone >= max_num_zones_to_report {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar question I guess, should this be async?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. Let me carry this change out when the discussion of the jamalloc-sys crate is resolved :)

@@ -814,6 +817,29 @@ impl MayastorEnvironment {
None
}

fn init_tracing(&self) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

init_spdk_tracing perhaps.

for uri in children {
let device_name = device_create(uri).await.unwrap();
if device_lookup(&device_name).unwrap().is_zoned() && children.len() > 1 {
return Err(Error::ZonedReplicationNotImplemented {});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are returning without creating all devices, don't we want to destroy back what we would've already created?


// Bit 07:00 Dword 13 > Zone Send Action
let zsa = zone_send_action_to_bdev_zone_action(nvme_cmd.cdw13 as u8).unwrap();

Copy link
Contributor

@hrudaya21 hrudaya21 Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid unwrap() here.

let bdev_zone_infos;

let ret = unsafe {
bdev_zone_infos = calloc(max_num_zones_to_report as usize, size_of_spdk_bdev_zone_info);
Copy link
Contributor

@hrudaya21 hrudaya21 Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to have this allcation inside the SPDK function, so that, if any failure happens, We can have a graceful exit here ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, there is no convenient way to allocate this struct within SPDK.

Can you suggest a nicer way to allocate memory for this struct? :)

if dev.is_zoned() {
//TODO: Implement partitioning zoned block devices. This requires handling drive resources like max active/open zones.
warn!("The device '{}' is zoned. Partitioning zoned block devices into smaller devices is not implemented. Using the whole device.", dev.device_name());
start_blk = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may not work correctly because we have to reserve some space in the child device for metadata. That space calculation is done by calc_data_partition and start/end blocks are hence adjusted.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this! I was not aware that a metadata partition is necessary. I assumed that partitioning is just a feature that allows the creation of multiple Nexus from the same drive.

What kind of metadata needs space on the drive and where in the code can I find the IO requests being issued for those metadata operations? :)

Also, I am not sure if I fully understand what is happening in the match partition::calc_data_partition(self.req_size(), nb, bs) { code block. Does that mean the first self.req_size() bytes are reserved for metadata?

@@ -82,6 +85,14 @@ pub trait BlockDevice {
&self,
listener: DeviceEventSink,
) -> Result<(), CoreError>;

fn is_zoned(&self) -> bool;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've the same suggestion.

@@ -62,6 +62,9 @@ pub trait BlockDevice {
/// Checks whether target I/O type is supported by the device.
fn io_type_supported(&self, io_type: IoType) -> bool;

/// Checks whether target I/O type is supported by the device.
fn io_type_supported_by_device(&self, io_type: IoType) -> bool;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different from io_type_supported()? Please describe the difference in the comment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great question!
The io_type_supported_by_device still checks the supported io types of the drive.

For the ZNS feature, I had to implement NVMe IO passthrough support (see commit "feat(nexus): Add NVMe IO passthrough support") to enable Zone Management Send and Receive commands.

The io_type_supported returns always true for NvmeIo commands, because also zoned SpdkBlockDevices need to support the NvmeIo commands that are being issued to the zoned Nexus NVMe-oF target. Because SpdkBlockDevices do not natively support NvmeIo commands, the commit "feat(nexus): Emulate zone mgmt NvmeIo for SpdkBlockDevices" emulates this command.

The submit_io_passthru function takes the NvmeIo commands and either passes it on to the device if that command is natively supported or it checks if the NvmeIo.opc is emulated in software. Finally, the NvmeIo command gets rejected (errno::EOPNOTSUPP) if no emulation for that opcode is present.

I will adjust the docs of io_type_supported

Let me know if you have further questions or ideas on how to refactor this. :)

@@ -82,6 +85,14 @@ pub trait BlockDevice {
&self,
listener: DeviceEventSink,
) -> Result<(), CoreError>;

fn is_zoned(&self) -> bool;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add meaningful comments for all new methods, ideally with args and return value descriptions.

@@ -212,7 +223,51 @@ pub trait BlockDeviceHandle {
cb_arg: IoCompletionCallbackArg,
) -> Result<(), CoreError>;

/// TODO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add meaningful comments for all new methods, ideally with args and return value descriptions.

@@ -199,7 +199,7 @@ impl Default for MayastorCliArgs {
reactor_mask: "0x1".into(),
mem_size: 0,
rpc_address: "/var/tmp/mayastor.sock".to_string(),
no_pci: true,
no_pci: false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, why this change is needed? Is it only for ZNS?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change would just be a 'nice to have'. It does not only target ZNS.
As part of the "feat(test): Enable PCIe device testing" commit, this enables testing on real PCIe devices. The idea is to later write some test cases that can run in a CI environment on real devices.

Let me know if I should drop this change :)

@@ -517,6 +556,10 @@ impl<'n> Nexus<'n> {
unsafe { self.bdev().required_alignment() }
}

pub fn is_zoned(&self) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a doc comment for all new fields, methods, etc.

if dev.is_zoned() {
zone_info.zoned = true;
if !info_set {
zone_info.num_zones = dev.get_num_zones();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like adding a couple of helper methods to copy and compare zone info would make code much cleaner.

| IoType::Reset
| IoType::Unmap
| IoType::Flush => self.submit_all(),
IoType::ZoneAppend => {
warn!("ZoneAppend is explicitly disallowed, otherwise reading from different replicas won't work.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the nexus info in the warn! message.

pub fn is_zoned_nvme_error(status: IoCompletionStatus) -> bool {
match status {
IoCompletionStatus::NvmeError(NvmeStatus::CommandSpecific(cssc)) => match cssc {
CommandSpecificStatusCode::ZonedBoundaryError => true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pipe | operator would make this code clearer.

@@ -350,6 +350,23 @@ impl From<NvmeStatus> for IoCompletionStatus {
}
}

pub fn is_zoned_nvme_error(status: IoCompletionStatus) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a doc comment for all new fields, methods, etc.

zone_info.max_active_zones = dev.get_max_active_zones();
zone_info.optimal_open_zones = dev.get_optimal_open_zones();
info_set = true;
} else if zone_info.num_zones != dev.get_num_zones()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better implement or derive PartialEq trait for ZoneInfo

let mut children_devices = Vec::new();
for uri in children {
let device_name = device_create(uri).await.unwrap();
if device_lookup(&device_name).unwrap().is_zoned() && children.len() > 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid unwrap() here as well.

@@ -1371,6 +1480,14 @@ async fn nexus_create_internal(
return Ok(());
}

let mut children_devices = create_children_devices(children).await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the normal child (not zoned device), whether this function will fail ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the normal case, the create_children_devices function is only failing if the device_create function returns an error.

@orville-wright
Copy link
Contributor

hi @MaisenbacherD it looks like all the requested reviewers have given significant feedback on the PR. So you should have lots of guidance on moving this project forward now. Let me know of you need any more assistance. Happy to help.

@MaisenbacherD
Copy link
Author

Thanks everyone for the reviews! I am addressing your comments in the following days :)

} else {
match orig_nvme_cmd.opc() {
// Zone Management Send
121 => return hdl.emulate_zone_mgmt_send_io_passthru(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to have opcode enum.

@@ -7,7 +7,7 @@ use futures::{future::Future, FutureExt};

mod nexus_bdev;
mod nexus_bdev_children;
mod nexus_bdev_error;
pub mod nexus_bdev_error;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be pub(crate) and maybe at other places?

self.ns.is_zoned()
}

fn get_zone_size(&self) -> u64 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to use get_ prefix?


stdenv.mkDerivation rec {
pname = "btrfs-progs";
version = "5.14.1";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not pin the version? Also we use 5.6.x for btrfs operations on csi, so maybe it should be at least be in par with that.

@tiagolobocastro tiagolobocastro removed the Ready for review Ready to be reviewed by team label May 1, 2024
@orville-wright
Copy link
Contributor

hi @MaisenbacherD - checking in here on where things are on the project?

  • let's talk to see how we can move this forward

Move children device creation out from `Nexus::new_child` into a helper
function. Create the children devices before the Nexus is created and
add them after the Nexus's initialization.

This is in preperation for zoned storge support, as we need to inspect
the children device features before creating a Nexus such that we can
decide whether the Nexus must be zoned or not.

Signed-off-by: Dennis Maisenbacher <[email protected]>
Adding setter and getter for zone storage related fields to the Nexus and
BlockDevice.

Inspect the children before creating the Nexus to decide whether the Nexus
needs to be zoned or not. The Nexus inherrits the zoned related fields
form its child.

For now a zoned Nexus does not allow replication.

Partitioning of a zoned Nexus is also not supported for now.
The user will optain the whole zoned Nexus if requested.

Signed-off-by: Dennis Maisenbacher <[email protected]>
Passthrough NvmeIo if the child (NVMe device) supports it. This is
necessary for the Zoned Storage support, as the Zone Management Send and
Receive commands are issued to the Nexus as NvmeIo.

Those commands are issued when zones are reported or the state of a zone
is actively changed.

Signed-off-by: Dennis Maisenbacher <[email protected]>
Zoned block devices that do not support NvmeIo (zoned uring) need to
get the zone management send and receive NvmeIo translated that is beeing
issued on the Nexus.

Those commands are issued when zones are reported or the state of a zone
is actively changed.

Inspect fields of the incoming NVMe command and fill the buffer
according to the 'NVM Express Zoned Namespace Command Set Specification'
(see https://nvmexpress.org/developers/nvme-command-set-specifications/)
with the help of the SPDK zoned bdev wrapper functions.

Signed-off-by: Dennis Maisenbacher <[email protected]>
Testing zoned storage support through zoned uring with a backing zoned
nullblk device.

Signed-off-by: Dennis Maisenbacher <[email protected]>
Zoned related CommandSpecificStatusCodes's should not take out the
nexus.

Signed-off-by: Dennis Maisenbacher <[email protected]>
Copy link
Author

@MaisenbacherD MaisenbacherD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, everyone for the helpful reviews, and my apologies for the late reply!
I hope that I addressed all comments with either a discussion or a direct code change.
There are some more open discussions to resolve :)

Also, I will address the btrfs nix package comments and post a fix for the Failed to write to /dev/nvme-fabrics: Invalid argument on a more recent kernel when I am back from vacation in two weeks. :)

Please also have a look at the PR for spdk-rs that is required by this series: openebs/spdk-rs#24

@@ -199,7 +199,7 @@ impl Default for MayastorCliArgs {
reactor_mask: "0x1".into(),
mem_size: 0,
rpc_address: "/var/tmp/mayastor.sock".to_string(),
no_pci: true,
no_pci: false,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change would just be a 'nice to have'. It does not only target ZNS.
As part of the "feat(test): Enable PCIe device testing" commit, this enables testing on real PCIe devices. The idea is to later write some test cases that can run in a CI environment on real devices.

Let me know if I should drop this change :)

@@ -1371,6 +1480,14 @@ async fn nexus_create_internal(
return Ok(());
}

let mut children_devices = create_children_devices(children).await?;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the normal case, the create_children_devices function is only failing if the device_create function returns an error.

@@ -83,7 +174,7 @@ impl SpdkBlockDevice {
pub fn lookup_by_name(name: &str) -> Option<Box<dyn BlockDevice>> {
debug!("Searching SPDK devices for '{}'...", name);
let bdev = UntypedBdev::lookup_by_name(name)?;
debug!("SPDK {} device found: '{}'", bdev.driver(), name);
debug!("SPDK {} device found: '{}'", bdev.driver(), bdev.name());
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be no difference between name and bdev.name().

This change stuck from debugging and was meant to clarify that the name of the found bdev is supposed to be printed out and match with the debug statement above.

I can revert this change if you want :)

@@ -387,6 +418,13 @@ impl<'n> Nexus<'n> {
.with_block_count(0)
.with_required_alignment(9)
.with_data(n)
.with_zoned(zone_info.zoned)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree! Will change this :)

@@ -253,6 +256,7 @@ pub struct Nexus<'n> {
pub(super) nexus_target: Option<NexusTarget>,
/// Indicates if the Nexus has an I/O device.
pub(super) has_io_device: bool,
pub(super) is_zoned: bool,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove this field as it was unused. Zoned information can be retrieved via the bdev.

}

unsafe {
free(bdev_zone_info_c_void);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most likely yes. Let me wait on that with the discussion above about the jemalloc-sys crate. :)


let bdev_zone_info_c_void = unsafe { calloc(1, size_of_spdk_bdev_zone_info) };
loop {
if zone >= max_num_zones_to_report {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. Let me carry this change out when the discussion of the jamalloc-sys crate is resolved :)


let mut result;
loop {
result = unsafe {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. Let me carry this change out when the discussion of the jamalloc-sys crate is resolved :)

@@ -253,10 +266,20 @@ impl<'n> NexusBio<'n> {
if status == IoCompletionStatus::Success {
self.ctx_mut().successful += 1;
} else {
error!(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it makes sense. I will remove this line. :)


self.completion_error(child, status);
// Don't take zoned child out on zoned related nvme errors
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zoned storage related status codes like TOO_MAY_OPEN_ZONES or ZONE_IS_FULL might be handled by the consuming application that is aware of the zoned storage interface.

Because no zoned replication is enabled yet, taking out the nexus is a bit harsh.
With zoned replication support later on we might want to take the nexus out again, depending on the logic of the replication.

@orville-wright
Copy link
Contributor

orville-wright commented Jun 12, 2024

Thanks @MaisenbacherD -
We are very grateful that you have taken the time and effort to enhance openEBS with such an innovative feature. Your contribution to the project & the community is very valuable and very appreciated.
It makes openEBS even better for everyone.

We also appreciate that an industry respected storage innovator (WD.com) supports openEBS with their resources & your time/expertise. Thank you WD.com

The team will review and pass on comments. Great progress.

@orville-wright orville-wright added Selected for development Accepted into roadmap for product integration External Contrib High priority Community contrib code labels Jun 16, 2024
@MaisenbacherD
Copy link
Author

Hi @orville-wright,

Apart from the review comments, would it be possible to get some help with enabling the latest btrfs, libnvme and nvme-cli nix packages? I am not that fluent in nix yet. :)

@orville-wright
Copy link
Contributor

Sure thing. I'll loop get you connected with the right people on the team.

@orville-wright
Copy link
Contributor

orville-wright commented Aug 17, 2024

Hey @tiagolobocastro @niladrih and @Abhinandan-Purkait - Dennis @MaisenbacherD needs some eng help with some of the packages... specifically btrfs, libnvme and nvme-cli nix.
He's said that he's not that fluent in nix yet.

  • Since this is an important community / partner led feature... can you guys sync up with Dennis and let's see if we can help him out.
  • We really need to get this awesome ZNS feature coded & released.

~Dave

@Abhinandan-Purkait
Copy link
Member

Sure. FYI, btrfs support has already been added to mayastor, We just need to specify fstype: btrfs in storage class to have brtfs filesystem volumes backed by mayastor.

@orville-wright
Copy link
Contributor

Hi @MaisenbacherD looks like some feedback was given by @Abhinandan-Purkait , but I'm not sure if he provided enough of the info you needed ?

@MaisenbacherD
Copy link
Author

@orville-wright, @Abhinandan-Purkait great! I was not aware that btrfs is already an option :)

The current libnvme and nvme-cli nix packages in use look outdated. I would need help to get the latest nix upstream packages configured. I am not sure that adjusting e.g. nix/pkgs/nvmet-cli/default.nix is the correct way to do that :)

@tiagolobocastro
Copy link
Contributor

@orville-wright, @Abhinandan-Purkait great! I was not aware that btrfs is already an option :)

The current libnvme and nvme-cli nix packages in use look outdated. I would need help to get the latest nix upstream packages configured. I am not sure that adjusting e.g. nix/pkgs/nvmet-cli/default.nix is the correct way to do that :)

To be honest, you can probably just remove the specific files we've got for those, and just rely on the versions from nixpkgs.
Otherwise you can edit those files to pull the latest versions you need.

@tiagolobocastro
Copy link
Contributor

@MaisenbacherD sorry for short notice, would you be able to join our community meeting today: https://us05web.zoom.us/j/87535654586?pwd=CigbXigJPn38USc6Vuzt7qSVFoO79X.1
It's at 2pm UTC

@MaisenbacherD
Copy link
Author

Hi @tiagolobocastro. Yes, I will join :) Talk to you soon!

@tiagolobocastro
Copy link
Contributor

tiagolobocastro commented Oct 8, 2024

Should we close this PR for now?

I guess we forgot to add an update here @MaisenbacherD
We discussed in the community meeting and then offline and we came to the conclusion that perhaps this is not the best way to start adding ZNS to mayastor because adding multiple ZNS children to the nexus, handling IO retries etc would become quite a complicated or perhaps not even possible.
Instead Dennis proposed he'd start looking at CSAL as another approach for bringing ZNS to Mayastor.

It's been a while, so I hope I'm not misremembering it @MaisenbacherD?

(btw besides the nexus changes, there are some code changes here which would be interesting to take as it adds value by itself)

@MaisenbacherD
Copy link
Author

@tiagolobocastro Thanks for remembering to post our discussions here :)

I am closing this PR draft now.
I just came back from conferences and vacation and I will start working on the CSAL patches in the next days.

Also, I will make sure to bring the ZNS-unrelated commits of this PR to the CSAL patchset :)

Thanks, everyone for the in-depth discussions on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement New feature or request External Contrib High priority Community contrib code PR Pending Ready for review Ready to be reviewed by team reviewing Selected for development Accepted into roadmap for product integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants